Para este projeto usei a base de dados do Inside.
Você também pode encontrar a base de dados aqui.
Neste projeto busco entender as seguintes questões:
- Quais variáveis apresentam maior influencia sobre o preço de uma
listagem Airbnb?
- Qual o melhor período para alugar um AirBnb em Buenos Aires?
- Além de outros insights que são disponibilizados ao longo do
entendimento do nosso dataset.
pacotes <- c("plotly","tidyverse","ggrepel","fastDummies","knitr","kableExtra",
"splines","reshape2","PerformanceAnalytics","correlation","see",
"ggraph", "car", "olsrr", "jtools", "ggside", "ggplot2", "tidyquant", "DT")
options(rgl.debug = TRUE)
if(sum(as.numeric(!pacotes %in% installed.packages())) != 0){
instalador <- pacotes[!pacotes %in% installed.packages()]
for(i in 1:length(instalador)) {
install.packages(instalador, dependencies = T)
break()}
sapply(pacotes, require, character = T)
} else {
sapply(pacotes, require, character = T)
}
## plotly tidyverse ggrepel
## TRUE TRUE TRUE
## fastDummies knitr kableExtra
## TRUE TRUE TRUE
## splines reshape2 PerformanceAnalytics
## TRUE TRUE TRUE
## correlation see ggraph
## TRUE TRUE TRUE
## car olsrr jtools
## TRUE TRUE TRUE
## ggside ggplot2 tidyquant
## TRUE TRUE TRUE
## DT
## TRUE
listing_df <- read_csv('data/listings.csv') #contém conjunto de dados airbnb completo de Buenos Aires
## Rows: 22713 Columns: 75
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (25): listing_url, source, name, description, neighborhood_overview, pi...
## dbl (37): id, scrape_id, host_id, host_listings_count, host_total_listings_...
## lgl (8): host_is_superhost, host_has_profile_pic, host_identity_verified, ...
## date (5): last_scraped, host_since, calendar_last_scraped, first_review, la...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Abaixo temos os 15 bairros com maior número de listagens cadastradas
listing_df %>%
group_by(neighbourhood_cleansed) %>%
summarise(qtd_bairros = n()) %>%
slice_max(qtd_bairros, n=15) %>%
mutate(neighbourhood_cleansed = reorder(neighbourhood_cleansed, -qtd_bairros)) %>%
ggplot(aes(x = neighbourhood_cleansed, y = qtd_bairros, fill=neighbourhood_cleansed)) +
theme(axis.text.x = element_text(angle = 90))+
geom_col()
Caso você vá até Buenos Aires, estes são os bairros em que você tem maiores chances de encontrar uma acomodação.
Média de preço por bairro, considerando os 5 mais caros
- Para calcular a média, precisamos fazer uns ajustes na coluna “price”,
removendo $ e a vírgula e tornando-a numerica.
listing_df$price <- str_replace_all(listing_df$price,'[$]','')
listing_df$price <- str_replace_all(listing_df$price,',','')
listing_df$price <- as.numeric(listing_df$price)
listing_df %>%
group_by(neighbourhood_cleansed) %>%
summarise(avg_price = mean(price)) %>%
slice_max(avg_price, n=5) %>%
kable() %>%
kable_styling(bootstrap_options = "striped",
full_width = F,
font_size = 22)
| neighbourhood_cleansed | avg_price |
|---|---|
| Coghlan | 260997.61 |
| Puerto Madero | 25266.96 |
| Barracas | 23476.08 |
| San Telmo | 20736.60 |
| Monte Castro | 16837.25 |
Considerando os tipos de quartos disponíveis (room_type), qual a média de preço em cada um deles?
listing_df %>%
group_by(room_type) %>%
summarise(avg_price = mean(price)) %>%
kable() %>%
kable_styling(bootstrap_options = "striped",
full_width = F,
font_size = 22)
| room_type | avg_price |
|---|---|
| Entire home/apt | 15681.38 |
| Hotel room | 51544.88 |
| Private room | 11035.68 |
| Shared room | 24706.32 |
Como observado acima, quando se vai a Buenos Aires é mais caro ficar
em hotel. Entretando é um pouco estranho que um quarto privado seja mais
barato que um quarto compartilhado, você não acha? Vamos investigar
isso?
Vamos ver como os preços se distribuem em função de cada tipo de quarto
quem sabe alguns outliers estejam influenciando o valor médio das
categorias.
ggplotly(
ggplot(listing_df, aes(x = room_type, y = price)) +
geom_point(color = "#39568CFF", size = 2.5) +
labs(x = "Type_room", y = "Price") +
theme_classic()
)
BINGO! Como a média está sendo influenciada pelos outliers, vou analisar a mediana, que sofre menor influência dos outliers e pode dar um valor mais adequado.
listing_df %>%
group_by(room_type) %>%
summarise(median_price = median(price)) %>%
kable() %>%
kable_styling(bootstrap_options = "striped",
full_width = F,
font_size = 22)
| room_type | median_price |
|---|---|
| Entire home/apt | 9318 |
| Hotel room | 8142 |
| Private room | 4763 |
| Shared room | 4000 |
Agora podemos concluir que os quartos compartilhados são os mais baratos. Tudo faz mais sentido agora, não acha?
Antes, precisamos fazer alguns ajustes.
listing_df$bathrooms_text <- str_replace_all(listing_df$bathrooms_text,'bath','')
listing_df$bathrooms_text <- str_replace_all(listing_df$bathrooms_text,'s','')
listing_df$bathrooms_text <- str_replace_all(listing_df$bathrooms_text,'S','')
listing_df$bathrooms_text <- str_replace_all(listing_df$bathrooms_text,'private','')
listing_df$bathrooms_text <- str_replace_all(listing_df$bathrooms_text,'Private','')
listing_df$bathrooms_text <- str_replace_all(listing_df$bathrooms_text,'hared','')
listing_df$bathrooms_text <- str_replace_all(listing_df$bathrooms_text,'half-','')
listing_df$bathrooms_text <- str_replace_all(listing_df$bathrooms_text,'Half-','')
listing_df$bathrooms_text <- as.numeric(listing_df$bathrooms_text)
Correlação entre preço, nº de camas, nº de quartos e nº de banheiros.
chart.Correlation((listing_df[,c(41,37,38,39)]), histogram = TRUE)
Pelo visto o preço é influenciado pelo numero de camas, quartos e banheiros que uma listagem tem, entretando mesmo sendo significativa essa correlação, podemos observar que ela não é muito alta.
Outra pergunta interessante é:
- O preço dos imóveis mudam significativamente ao longo do ano?
- Para responder essa questão vamos usar o dataset calendar_df.
calendar_df <- read.csv("data/calendar.csv") #contém o preço de cada listagem durante o período de um ano
As variáveis date e price precisam de alguns ajustes para realizar nossa análise.
calendar_df$date <- as.Date(calendar_df$date)
calendar_df['month'] <- (format(calendar_df$date, '%Y-%m'))
calendar_df$month <- as.factor(calendar_df$month)
calendar_df$price <- str_replace_all(calendar_df$price,'[$]','')
calendar_df$price <- str_replace_all(calendar_df$price,',','')
calendar_df$price <- as.numeric(calendar_df$price)
Mês x mediana do preço
price_month <- calendar_df %>%
group_by(month) %>%
summarise(median_price = median(price)) %>%
ggplot(aes(x = month, y = median_price, group=1)) +
geom_line(color='grey') +
geom_point() +
guides(x = guide_axis(angle = 90)) +
labs(x= 'Month', y= 'Median Price',
title = 'Price per month') +
theme_classic()
price_month
Excluindo colunas vazias, com url, localização, comentários, nomes do host pois não serão usadas nesta análise.
listing_df <- subset(listing_df, select = -c(id, listing_url, scrape_id, picture_url,host_id, host_url,
host_thumbnail_url, host_picture_url, neighbourhood_group_cleansed,
review_scores_value, calendar_updated, license, bathrooms,neighbourhood,
neighborhood_overview, host_neighbourhood, host_location, host_response_rate,
host_about,description, name, host_name, first_review, last_review))
listing_df$beds[is.na(listing_df$beds)] <- 1
listing_df$bedrooms[is.na(listing_df$bedrooms)] <- 1
listing_df$bathrooms_text[is.na(listing_df$bathrooms_text)] <- 1
listing_df$number_of_reviews[is.na(listing_df$number_of_reviews)] <- 0
listing_df$reviews_per_month[is.na(listing_df$reviews_per_month)] <- 0
#verificando quais são as variáveis lógicas presentes no dataset
(to.replace <- names(which(sapply(listing_df, is.logical))))
## [1] "host_is_superhost" "host_has_profile_pic" "host_identity_verified"
## [4] "has_availability" "instant_bookable"
library(data.table)
##
## Attaching package: 'data.table'
## The following objects are masked from 'package:xts':
##
## first, last
## The following objects are masked from 'package:reshape2':
##
## dcast, melt
## The following objects are masked from 'package:lubridate':
##
## hour, isoweek, mday, minute, month, quarter, second, wday, week,
## yday, year
## The following objects are masked from 'package:dplyr':
##
## between, first, last
## The following object is masked from 'package:purrr':
##
## transpose
Cols <- which(sapply(listing_df, is.logical))
setDT(listing_df)
for(j in Cols){
set(listing_df, i=NULL, j=j, value= as.numeric(listing_df[[j]]))
}
listing_df$source <- as.factor(listing_df$source)
listing_df$property_type <- as.factor(listing_df$property_type)
listing_df$host_response_time[listing_df$host_response_time == 'N/A'] <- "did not inform"
listing_df$host_response_time <- as.factor(listing_df$host_response_time)
listing_df$neighbourhood_cleansed <- as.factor(listing_df$neighbourhood_cleansed)
listing_df$room_type <- as.factor(listing_df$room_type)
listing_df$host_verifications[listing_df$host_verifications == '[]'] <- 1
listing_df$host_verifications <- as.factor(listing_df$host_verifications)
listing_df$host_verifications <- droplevels(listing_df$host_verifications, exclude = 1)
listing_df$amenities <- lengths(gregexpr(",", listing_df$amenities)) + 1L
listing_df$host_acceptance_rate <- str_remove_all(listing_df$host_acceptance_rate, '[%]')
listing_df$host_acceptance_rate <- as.numeric(listing_df$host_acceptance_rate)
Depois de uma breve análise, observei que as variáveis abaixo não são relevantes,portanto, vamos excluí-las.
listing_df <- subset(listing_df, select = -c(last_scraped, calendar_last_scraped))
Função criada para identificação de outliers através do método de quartil
quartil <- function(column){
q1 <- quantile(column, 0.25, na.rm = TRUE) #1º quartil
q3 <- quantile(column, 0.75, na.rm = TRUE) #3º quartil
iq <- q3 - q1 #interquartil
lim_sup <- q3 + 1.5*iq #limite superior
return(lim_sup)
}
Aplicação da função
max_beds<- quartil(listing_df$beds)
max_bedrooms <- quartil(listing_df$bedrooms)
max_bathrooms <- quartil(listing_df$bathrooms_text)
max_price <- quartil(listing_df$price)
Valores que estão acima do limite superior
print(paste("beds:",max_beds, "bedrooms:", max_bedrooms, "bathrooms:", max_bathrooms, "price:", max_price))
## [1] "beds: 3.5 bedrooms: 1 bathrooms: 2.25 price: 24068"
Agora vou descartar qualquer linha onde preço esteja acima do limite
superior estimado para cada variável.
Excluindo outliers das colunas
for (i in seq_along(listing_df$beds)){
if (listing_df$beds[i] > 3.5){
listing_df$beds[i] <- mean(listing_df$beds)
}
}
for (i in seq_along(listing_df$bedrooms)){
if (listing_df$bedrooms[i] > 1){
listing_df$bedrooms[i] <- 1
}
}
for (i in seq_along(listing_df$bathrooms_text)){
if (listing_df$bathrooms_text[i] > 2.25){
listing_df$bathrooms_text[i] <- mean(listing_df$bathrooms_text)
}
}
for (i in seq_along(listing_df$price)){
if (listing_df$price[i] > 24068){
listing_df$price[i] <- mean(listing_df$price)
}
}
Observe como os valores discrepantes foram eliminados.
boxplot(listing_df$bedrooms)
boxplot(listing_df$beds)
boxplot(listing_df$bathrooms_text)
boxplot(listing_df$price)
boxplot(listing_df$host_acceptance_rate)
Vamos verificar se ainda existem muitos valores NAs presentes em nosso
dataset
sapply(listing_df, function(x) sum(is.na(x)))
## source
## 0
## host_since
## 0
## host_response_time
## 0
## host_acceptance_rate
## 2105
## host_is_superhost
## 0
## host_listings_count
## 0
## host_total_listings_count
## 0
## host_verifications
## 43
## host_has_profile_pic
## 0
## host_identity_verified
## 0
## neighbourhood_cleansed
## 0
## latitude
## 0
## longitude
## 0
## property_type
## 0
## room_type
## 0
## accommodates
## 0
## bathrooms_text
## 0
## bedrooms
## 0
## beds
## 0
## amenities
## 0
## price
## 0
## minimum_nights
## 0
## maximum_nights
## 0
## minimum_minimum_nights
## 0
## maximum_minimum_nights
## 0
## minimum_maximum_nights
## 0
## maximum_maximum_nights
## 0
## minimum_nights_avg_ntm
## 0
## maximum_nights_avg_ntm
## 0
## has_availability
## 0
## availability_30
## 0
## availability_60
## 0
## availability_90
## 0
## availability_365
## 0
## number_of_reviews
## 0
## number_of_reviews_ltm
## 0
## number_of_reviews_l30d
## 0
## review_scores_rating
## 4122
## review_scores_accuracy
## 4202
## review_scores_cleanliness
## 4202
## review_scores_checkin
## 4202
## review_scores_communication
## 4201
## review_scores_location
## 4201
## instant_bookable
## 0
## calculated_host_listings_count
## 0
## calculated_host_listings_count_entire_homes
## 0
## calculated_host_listings_count_private_rooms
## 0
## calculated_host_listings_count_shared_rooms
## 0
## reviews_per_month
## 0
Vamos tratar esses valores faltantes que restam
listing_df$host_acceptance_rate[is.na(listing_df$host_acceptance_rate)] <- 77.077
listing_df <- listing_df[!is.na(listing_df$host_verifications),]
listing_df <- subset(listing_df, select = -c(review_scores_accuracy,review_scores_communication,review_scores_cleanliness, review_scores_location,review_scores_rating,review_scores_checkin))
As variáveis com score foram excluídas porquê tem uma quantidade muito elevada de NAs, portanto manter eslas pode influênciar muito no resultado final do modelo.
listing_df <- subset(listing_df, select = -c(source, host_since, host_response_time, host_verifications, neighbourhood_cleansed,property_type))
listing_df_1_dummies <- dummy_columns(.data = listing_df,
select_columns = c("room_type"),
remove_selected_columns = T,
remove_most_frequent_dummy = T)
summary(listing_df_1_dummies)
## host_acceptance_rate host_is_superhost host_listings_count
## Min. : 0.00 Min. :0.0000 Min. : 1.00
## 1st Qu.: 77.08 1st Qu.:0.0000 1st Qu.: 1.00
## Median : 96.00 Median :0.0000 Median : 3.00
## Mean : 84.21 Mean :0.3266 Mean : 17.58
## 3rd Qu.:100.00 3rd Qu.:1.0000 3rd Qu.: 13.00
## Max. :100.00 Max. :1.0000 Max. :1787.00
## host_total_listings_count host_has_profile_pic host_identity_verified
## Min. : 1.00 Min. :0.000 Min. :0.0000
## 1st Qu.: 1.00 1st Qu.:1.000 1st Qu.:1.0000
## Median : 3.00 Median :1.000 Median :1.0000
## Mean : 25.01 Mean :0.982 Mean :0.8727
## 3rd Qu.: 17.00 3rd Qu.:1.000 3rd Qu.:1.0000
## Max. :3160.00 Max. :1.000 Max. :1.0000
## latitude longitude accommodates bathrooms_text bedrooms
## Min. :-34.69 Min. :-58.53 Min. : 1.000 Min. :0.000 Min. :1
## 1st Qu.:-34.60 1st Qu.:-58.44 1st Qu.: 2.000 1st Qu.:1.000 1st Qu.:1
## Median :-34.59 Median :-58.42 Median : 2.000 Median :1.000 Median :1
## Mean :-34.59 Mean :-58.42 Mean : 2.874 Mean :1.155 Mean :1
## 3rd Qu.:-34.58 3rd Qu.:-58.39 3rd Qu.: 4.000 3rd Qu.:1.179 3rd Qu.:1
## Max. :-34.53 Max. :-58.36 Max. :16.000 Max. :2.000 Max. :1
## beds amenities price minimum_nights
## Min. :1.000 Min. : 2.00 Min. : 175 Min. : 1.000
## 1st Qu.:1.000 1st Qu.: 18.00 1st Qu.: 6388 1st Qu.: 2.000
## Median :1.000 Median : 30.00 Median : 8969 Median : 3.000
## Mean :1.599 Mean : 30.37 Mean : 9765 Mean : 6.826
## 3rd Qu.:2.000 3rd Qu.: 42.00 3rd Qu.:12286 3rd Qu.: 5.000
## Max. :3.000 Max. :103.00 Max. :24050 Max. :1000.000
## maximum_nights minimum_minimum_nights maximum_minimum_nights
## Min. : 1.0 Min. : 1.000 Min. : 1.00
## 1st Qu.: 90.0 1st Qu.: 2.000 1st Qu.: 2.00
## Median : 365.0 Median : 3.000 Median : 3.00
## Mean : 531.8 Mean : 6.522 Mean : 6.91
## 3rd Qu.: 1125.0 3rd Qu.: 4.000 3rd Qu.: 5.00
## Max. :99999.0 Max. :1000.000 Max. :1000.00
## minimum_maximum_nights maximum_maximum_nights minimum_nights_avg_ntm
## Min. :1.000e+00 Min. :1.000e+00 Min. : 1.000
## 1st Qu.:3.650e+02 1st Qu.:3.650e+02 1st Qu.: 2.000
## Median :1.125e+03 Median :1.125e+03 Median : 3.000
## Mean :6.638e+05 Mean :6.638e+05 Mean : 6.753
## 3rd Qu.:1.125e+03 3rd Qu.:1.125e+03 3rd Qu.: 5.000
## Max. :2.147e+09 Max. :2.147e+09 Max. :1000.000
## maximum_nights_avg_ntm has_availability availability_30 availability_60
## Min. :1.000e+00 Min. :0.0000 Min. : 0.00 Min. : 0.00
## 1st Qu.:3.650e+02 1st Qu.:1.0000 1st Qu.: 0.00 1st Qu.:13.00
## Median :1.125e+03 Median :1.0000 Median :12.00 Median :37.00
## Mean :6.638e+05 Mean :0.9829 Mean :12.41 Mean :32.69
## 3rd Qu.:1.125e+03 3rd Qu.:1.0000 3rd Qu.:22.00 3rd Qu.:51.00
## Max. :2.147e+09 Max. :1.0000 Max. :30.00 Max. :60.00
## availability_90 availability_365 number_of_reviews number_of_reviews_ltm
## Min. : 0.00 Min. : 0.0 Min. : 0.00 Min. : 0.000
## 1st Qu.:33.00 1st Qu.: 89.0 1st Qu.: 1.00 1st Qu.: 0.000
## Median :65.00 Median :247.0 Median : 8.00 Median : 4.000
## Mean :55.56 Mean :219.9 Mean : 22.14 Mean : 9.303
## 3rd Qu.:81.00 3rd Qu.:344.0 3rd Qu.: 26.00 3rd Qu.: 13.000
## Max. :90.00 Max. :365.0 Max. :637.00 Max. :252.000
## number_of_reviews_l30d instant_bookable calculated_host_listings_count
## Min. : 0.0000 Min. :0.0000 Min. : 1.00
## 1st Qu.: 0.0000 1st Qu.:0.0000 1st Qu.: 1.00
## Median : 0.0000 Median :0.0000 Median : 2.00
## Mean : 0.9931 Mean :0.2905 Mean : 14.09
## 3rd Qu.: 1.0000 3rd Qu.:1.0000 3rd Qu.: 10.00
## Max. :44.0000 Max. :1.0000 Max. :150.00
## calculated_host_listings_count_entire_homes
## Min. : 0.00
## 1st Qu.: 1.00
## Median : 2.00
## Mean : 13.27
## 3rd Qu.: 9.00
## Max. :150.00
## calculated_host_listings_count_private_rooms
## Min. : 0.000
## 1st Qu.: 0.000
## Median : 0.000
## Mean : 0.624
## 3rd Qu.: 0.000
## Max. :29.000
## calculated_host_listings_count_shared_rooms reviews_per_month
## Min. : 0.00000 Min. : 0.000
## 1st Qu.: 0.00000 1st Qu.: 0.100
## Median : 0.00000 Median : 0.670
## Mean : 0.05748 Mean : 1.102
## 3rd Qu.: 0.00000 3rd Qu.: 1.650
## Max. :16.00000 Max. :19.940
## room_type_Hotel room room_type_Private room room_type_Shared room
## Min. :0.000000 Min. :0.00000 Min. :0.000000
## 1st Qu.:0.000000 1st Qu.:0.00000 1st Qu.:0.000000
## Median :0.000000 Median :0.00000 Median :0.000000
## Mean :0.004455 Mean :0.09479 Mean :0.007411
## 3rd Qu.:0.000000 3rd Qu.:0.00000 3rd Qu.:0.000000
## Max. :1.000000 Max. :1.00000 Max. :1.000000
summary(modelo_listing)
##
## Call:
## lm(formula = price ~ ., data = listing_df_1_dummies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11373.6 -2589.0 -748.9 1802.8 17401.8
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value
## (Intercept) 2.799e+06 1.070e+05 26.163
## host_acceptance_rate -6.904e+00 1.185e+00 -5.826
## host_is_superhost 5.075e+02 6.020e+01 8.431
## host_listings_count -1.204e+01 2.374e+00 -5.071
## host_total_listings_count 9.299e+00 1.315e+00 7.071
## host_has_profile_pic -3.246e+02 1.933e+02 -1.679
## host_identity_verified -8.115e+01 8.093e+01 -1.003
## latitude 4.485e+04 1.734e+03 25.872
## longitude 2.129e+04 1.041e+03 20.449
## accommodates 5.207e+02 1.978e+01 26.325
## bathrooms_text 3.189e+03 8.749e+01 36.448
## bedrooms NA NA NA
## beds 5.219e+02 4.051e+01 12.883
## amenities 3.264e+01 1.838e+00 17.760
## minimum_nights 7.079e+00 2.555e+00 2.771
## maximum_nights 2.390e-02 3.150e-02 0.759
## minimum_minimum_nights 1.243e+01 1.318e+01 0.943
## maximum_minimum_nights 2.577e+01 1.493e+01 1.726
## minimum_maximum_nights 2.403e-01 3.759e-01 0.639
## maximum_maximum_nights 1.711e+00 7.290e-01 2.348
## minimum_nights_avg_ntm -5.218e+01 2.330e+01 -2.240
## maximum_nights_avg_ntm -1.952e+00 8.934e-01 -2.184
## has_availability -1.448e+03 2.007e+02 -7.214
## availability_30 6.864e+01 6.264e+00 10.959
## availability_60 -4.802e+00 6.667e+00 -0.720
## availability_90 1.643e+00 3.547e+00 0.463
## availability_365 1.761e+00 2.365e-01 7.447
## number_of_reviews -7.715e-02 8.711e-01 -0.089
## number_of_reviews_ltm -4.908e+00 3.312e+00 -1.482
## number_of_reviews_l30d -7.401e+01 2.385e+01 -3.103
## instant_bookable 2.557e+02 5.992e+01 4.266
## calculated_host_listings_count -1.449e+02 1.720e+01 -8.420
## calculated_host_listings_count_entire_homes 1.591e+02 1.722e+01 9.242
## calculated_host_listings_count_private_rooms 8.776e+01 2.084e+01 4.212
## calculated_host_listings_count_shared_rooms -4.866e+00 4.996e+01 -0.097
## reviews_per_month -3.523e+02 3.165e+01 -11.133
## `room_type_Hotel room` 1.515e+03 5.074e+02 2.986
## `room_type_Private room` -3.013e+03 1.020e+02 -29.527
## `room_type_Shared room` -3.657e+03 3.608e+02 -10.136
## Pr(>|t|)
## (Intercept) < 2e-16 ***
## host_acceptance_rate 5.75e-09 ***
## host_is_superhost < 2e-16 ***
## host_listings_count 3.99e-07 ***
## host_total_listings_count 1.58e-12 ***
## host_has_profile_pic 0.09313 .
## host_identity_verified 0.31599
## latitude < 2e-16 ***
## longitude < 2e-16 ***
## accommodates < 2e-16 ***
## bathrooms_text < 2e-16 ***
## bedrooms NA
## beds < 2e-16 ***
## amenities < 2e-16 ***
## minimum_nights 0.00560 **
## maximum_nights 0.44812
## minimum_minimum_nights 0.34579
## maximum_minimum_nights 0.08439 .
## minimum_maximum_nights 0.52267
## maximum_maximum_nights 0.01890 *
## minimum_nights_avg_ntm 0.02511 *
## maximum_nights_avg_ntm 0.02894 *
## has_availability 5.61e-13 ***
## availability_30 < 2e-16 ***
## availability_60 0.47135
## availability_90 0.64323
## availability_365 9.90e-14 ***
## number_of_reviews 0.92943
## number_of_reviews_ltm 0.13841
## number_of_reviews_l30d 0.00192 **
## instant_bookable 1.99e-05 ***
## calculated_host_listings_count < 2e-16 ***
## calculated_host_listings_count_entire_homes < 2e-16 ***
## calculated_host_listings_count_private_rooms 2.54e-05 ***
## calculated_host_listings_count_shared_rooms 0.92240
## reviews_per_month < 2e-16 ***
## `room_type_Hotel room` 0.00283 **
## `room_type_Private room` < 2e-16 ***
## `room_type_Shared room` < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3797 on 22632 degrees of freedom
## Multiple R-squared: 0.298, Adjusted R-squared: 0.2969
## F-statistic: 259.7 on 37 and 22632 DF, p-value: < 2.2e-16
summary(step_modelo_listing)
##
## Call:
## lm(formula = price ~ host_acceptance_rate + host_is_superhost +
## host_listings_count + host_total_listings_count + latitude +
## longitude + accommodates + bathrooms_text + beds + amenities +
## minimum_nights + maximum_maximum_nights + minimum_nights_avg_ntm +
## maximum_nights_avg_ntm + has_availability + availability_30 +
## availability_365 + number_of_reviews_l30d + instant_bookable +
## calculated_host_listings_count + calculated_host_listings_count_entire_homes +
## calculated_host_listings_count_private_rooms + reviews_per_month +
## `room_type_Hotel room` + `room_type_Private room` + `room_type_Shared room`,
## data = listing_df_1_dummies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11403.1 -2588.3 -754.7 1802.9 17733.2
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 2.783e+06 1.067e+05 26.076
## host_acceptance_rate -7.174e+00 1.175e+00 -6.108
## host_is_superhost 4.732e+02 5.848e+01 8.092
## host_listings_count -1.202e+01 2.372e+00 -5.066
## host_total_listings_count 9.279e+00 1.314e+00 7.061
## latitude 4.464e+04 1.732e+03 25.773
## longitude 2.114e+04 1.037e+03 20.390
## accommodates 5.203e+02 1.977e+01 26.319
## bathrooms_text 3.190e+03 8.742e+01 36.494
## beds 5.229e+02 4.050e+01 12.913
## amenities 3.195e+01 1.814e+00 17.611
## minimum_nights 7.019e+00 2.552e+00 2.751
## maximum_maximum_nights 1.636e+00 7.080e-01 2.311
## minimum_nights_avg_ntm -1.388e+01 2.736e+00 -5.074
## maximum_nights_avg_ntm -1.636e+00 7.080e-01 -2.311
## has_availability -1.491e+03 1.986e+02 -7.508
## availability_30 6.425e+01 2.584e+00 24.866
## availability_365 1.776e+00 2.151e-01 8.256
## number_of_reviews_l30d -8.658e+01 2.276e+01 -3.803
## instant_bookable 2.553e+02 5.969e+01 4.278
## calculated_host_listings_count -1.464e+02 1.532e+01 -9.555
## calculated_host_listings_count_entire_homes 1.604e+02 1.533e+01 10.468
## calculated_host_listings_count_private_rooms 8.827e+01 1.944e+01 4.541
## reviews_per_month -3.728e+02 2.985e+01 -12.488
## `room_type_Hotel room` 1.528e+03 4.990e+02 3.061
## `room_type_Private room` -3.010e+03 1.017e+02 -29.586
## `room_type_Shared room` -3.670e+03 3.092e+02 -11.869
## Pr(>|t|)
## (Intercept) < 2e-16 ***
## host_acceptance_rate 1.03e-09 ***
## host_is_superhost 6.15e-16 ***
## host_listings_count 4.10e-07 ***
## host_total_listings_count 1.70e-12 ***
## latitude < 2e-16 ***
## longitude < 2e-16 ***
## accommodates < 2e-16 ***
## bathrooms_text < 2e-16 ***
## beds < 2e-16 ***
## amenities < 2e-16 ***
## minimum_nights 0.005952 **
## maximum_maximum_nights 0.020840 *
## minimum_nights_avg_ntm 3.93e-07 ***
## maximum_nights_avg_ntm 0.020840 *
## has_availability 6.25e-14 ***
## availability_30 < 2e-16 ***
## availability_365 < 2e-16 ***
## number_of_reviews_l30d 0.000143 ***
## instant_bookable 1.89e-05 ***
## calculated_host_listings_count < 2e-16 ***
## calculated_host_listings_count_entire_homes < 2e-16 ***
## calculated_host_listings_count_private_rooms 5.62e-06 ***
## reviews_per_month < 2e-16 ***
## `room_type_Hotel room` 0.002205 **
## `room_type_Private room` < 2e-16 ***
## `room_type_Shared room` < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3797 on 22643 degrees of freedom
## Multiple R-squared: 0.2976, Adjusted R-squared: 0.2968
## F-statistic: 369 on 26 and 22643 DF, p-value: < 2.2e-16
Kernel density estimation (KDE)
listing_df_1_dummies %>%
ggplot() +
geom_density(aes(x = step_modelo_listing$residuals), fill = "#55C667FF") +
labs(x = "Residuos do Modelo Stepwise",
y = "Densidade") +
theme_bw()
### Teste de aderência dos resíduos à normalidade
sf_teste <- function (x)
{
DNAME <- deparse(substitute(x))
x <- sort(x[complete.cases(x)])
n <- length(x)
if ((n < 5 || n > 25000))
stop("sample size must be between 5 and 5000")
y <- qnorm(ppoints(n, a = 3/8))
W <- cor(x, y)^2
u <- log(n)
v <- log(u)
mu <- -1.2725 + 1.0521 * (v - u)
sig <- 1.0308 - 0.26758 * (v + 2/u)
z <- (log(1 - W) - mu)/sig
pval <- pnorm(z, lower.tail = FALSE)
RVAL <- list(statistic = c(W = W), p.value = pval, method = "Shapiro-Francia normality test",
data.name = DNAME)
class(RVAL) <- "htest"
return(RVAL)
}
sf_teste(step_modelo_listing$residuals)
##
## Shapiro-Francia normality test
##
## data: step_modelo_listing$residuals
## W = 0.94135, p-value < 2.2e-16
listing_df_1_dummies %>%
mutate(residuos = step_modelo_listing$residuals) %>%
ggplot(aes(x = residuos)) +
geom_histogram(aes(y = ..density..),
color = "white",
fill = "#440154FF",
bins = 30,
alpha = 0.6) +
stat_function(fun = dnorm,
args = list(mean = mean(step_modelo_listing$residuals),
sd = sd(step_modelo_listing$residuals)),
size = 2, color = "grey30") +
scale_color_manual(values = "grey50") +
labs(x = "Residuos",
y = "Frequencia") +
theme_bw()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
O teste de Shapiro-Francia comprovou a não derência à normalidade dos resíduos. Diante disso, vou fazer uma transformação Box-Cox na variável dependente e rodar novo modelo.
lambda_BC <- powerTransform(listing_df_1_dummies$price)
lambda_BC
## Estimated transformation parameter
## listing_df_1_dummies$price
## 0.339769
listing_df_1_dummies$bcprice <- (((listing_df_1_dummies$price ^ lambda_BC$lambda) - 1) /
lambda_BC$lambda)
modelo_listing_bc <- lm(formula = bcprice ~ . -price, na.rm = T,
data = listing_df_1_dummies)
## Warning: In lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
## extra argument 'na.rm' will be disregarded
summary(modelo_listing_bc)
##
## Call:
## lm(formula = bcprice ~ . - price, data = listing_df_1_dummies,
## na.rm = T)
##
## Residuals:
## Min 1Q Median 3Q Max
## -36.678 -5.860 -0.882 5.102 39.866
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value
## (Intercept) 6.934e+03 2.416e+02 28.702
## host_acceptance_rate -1.079e-02 2.675e-03 -4.035
## host_is_superhost 1.297e+00 1.359e-01 9.542
## host_listings_count -2.429e-02 5.359e-03 -4.532
## host_total_listings_count 2.070e-02 2.969e-03 6.971
## host_has_profile_pic -7.940e-01 4.365e-01 -1.819
## host_identity_verified -3.042e-02 1.827e-01 -0.166
## latitude 1.119e+02 3.914e+00 28.582
## longitude 5.159e+01 2.350e+00 21.953
## accommodates 1.256e+00 4.465e-02 28.135
## bathrooms_text 6.523e+00 1.975e-01 33.026
## bedrooms NA NA NA
## beds 1.166e+00 9.146e-02 12.750
## amenities 8.106e-02 4.149e-03 19.534
## minimum_nights 1.787e-02 5.768e-03 3.098
## maximum_nights 3.375e-05 7.113e-05 0.474
## minimum_minimum_nights 3.773e-02 2.976e-02 1.268
## maximum_minimum_nights 6.794e-02 3.372e-02 2.015
## minimum_maximum_nights 1.036e-03 8.486e-04 1.221
## maximum_maximum_nights 5.090e-03 1.646e-03 3.093
## minimum_nights_avg_ntm -1.442e-01 5.259e-02 -2.742
## maximum_nights_avg_ntm -6.126e-03 2.017e-03 -3.037
## has_availability -3.552e+00 4.531e-01 -7.840
## availability_30 1.736e-01 1.414e-02 12.279
## availability_60 -2.588e-02 1.505e-02 -1.720
## availability_90 1.790e-02 8.007e-03 2.236
## availability_365 4.353e-03 5.340e-04 8.152
## number_of_reviews -5.978e-04 1.967e-03 -0.304
## number_of_reviews_ltm -1.074e-02 7.478e-03 -1.437
## number_of_reviews_l30d -2.179e-01 5.385e-02 -4.046
## instant_bookable 5.928e-01 1.353e-01 4.382
## calculated_host_listings_count -4.477e-01 3.884e-02 -11.526
## calculated_host_listings_count_entire_homes 4.788e-01 3.888e-02 12.317
## calculated_host_listings_count_private_rooms 3.066e-01 4.704e-02 6.517
## calculated_host_listings_count_shared_rooms -6.129e-03 1.128e-01 -0.054
## reviews_per_month -7.747e-01 7.145e-02 -10.842
## `room_type_Hotel room` 2.651e+00 1.145e+00 2.314
## `room_type_Private room` -8.956e+00 2.304e-01 -38.881
## `room_type_Shared room` -1.150e+01 8.146e-01 -14.118
## Pr(>|t|)
## (Intercept) < 2e-16 ***
## host_acceptance_rate 5.48e-05 ***
## host_is_superhost < 2e-16 ***
## host_listings_count 5.88e-06 ***
## host_total_listings_count 3.23e-12 ***
## host_has_profile_pic 0.06890 .
## host_identity_verified 0.86777
## latitude < 2e-16 ***
## longitude < 2e-16 ***
## accommodates < 2e-16 ***
## bathrooms_text < 2e-16 ***
## bedrooms NA
## beds < 2e-16 ***
## amenities < 2e-16 ***
## minimum_nights 0.00195 **
## maximum_nights 0.63517
## minimum_minimum_nights 0.20482
## maximum_minimum_nights 0.04390 *
## minimum_maximum_nights 0.22217
## maximum_maximum_nights 0.00198 **
## minimum_nights_avg_ntm 0.00611 **
## maximum_nights_avg_ntm 0.00239 **
## has_availability 4.70e-15 ***
## availability_30 < 2e-16 ***
## availability_60 0.08550 .
## availability_90 0.02536 *
## availability_365 3.77e-16 ***
## number_of_reviews 0.76114
## number_of_reviews_ltm 0.15082
## number_of_reviews_l30d 5.22e-05 ***
## instant_bookable 1.18e-05 ***
## calculated_host_listings_count < 2e-16 ***
## calculated_host_listings_count_entire_homes < 2e-16 ***
## calculated_host_listings_count_private_rooms 7.34e-11 ***
## calculated_host_listings_count_shared_rooms 0.95666
## reviews_per_month < 2e-16 ***
## `room_type_Hotel room` 0.02065 *
## `room_type_Private room` < 2e-16 ***
## `room_type_Shared room` < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.573 on 22632 degrees of freedom
## Multiple R-squared: 0.3416, Adjusted R-squared: 0.3405
## F-statistic: 317.4 on 37 and 22632 DF, p-value: < 2.2e-16
Step-wise no modelo com box-cox
summary(step_modelo_listing_bc)
##
## Call:
## lm(formula = bcprice ~ host_acceptance_rate + host_is_superhost +
## host_listings_count + host_total_listings_count + latitude +
## longitude + accommodates + bathrooms_text + beds + amenities +
## minimum_nights + maximum_maximum_nights + minimum_nights_avg_ntm +
## maximum_nights_avg_ntm + has_availability + availability_30 +
## availability_365 + number_of_reviews_l30d + instant_bookable +
## calculated_host_listings_count + calculated_host_listings_count_entire_homes +
## calculated_host_listings_count_private_rooms + reviews_per_month +
## `room_type_Hotel room` + `room_type_Private room` + `room_type_Shared room`,
## data = listing_df_1_dummies, na.rm = T)
##
## Residuals:
## Min 1Q Median 3Q Max
## -36.904 -5.871 -0.879 5.089 40.235
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 6.892e+03 2.410e+02 28.595
## host_acceptance_rate -1.097e-02 2.652e-03 -4.135
## host_is_superhost 1.240e+00 1.320e-01 9.393
## host_listings_count -2.416e-02 5.356e-03 -4.511
## host_total_listings_count 2.060e-02 2.967e-03 6.942
## latitude 1.115e+02 3.911e+00 28.502
## longitude 5.113e+01 2.341e+00 21.836
## accommodates 1.254e+00 4.464e-02 28.082
## bathrooms_text 6.525e+00 1.974e-01 33.057
## beds 1.165e+00 9.144e-02 12.744
## amenities 8.018e-02 4.097e-03 19.572
## minimum_nights 1.765e-02 5.762e-03 3.063
## maximum_maximum_nights 4.727e-03 1.599e-03 2.957
## minimum_nights_avg_ntm -3.847e-02 6.178e-03 -6.227
## maximum_nights_avg_ntm -4.727e-03 1.599e-03 -2.957
## has_availability -3.543e+00 4.484e-01 -7.901
## availability_30 1.684e-01 5.834e-03 28.862
## availability_365 4.846e-03 4.856e-04 9.979
## number_of_reviews_l30d -2.428e-01 5.140e-02 -4.724
## instant_bookable 5.732e-01 1.348e-01 4.253
## calculated_host_listings_count -4.491e-01 3.460e-02 -12.982
## calculated_host_listings_count_entire_homes 4.799e-01 3.461e-02 13.868
## calculated_host_listings_count_private_rooms 3.074e-01 4.389e-02 7.003
## reviews_per_month -8.183e-01 6.741e-02 -12.139
## `room_type_Hotel room` 2.612e+00 1.127e+00 2.318
## `room_type_Private room` -8.980e+00 2.297e-01 -39.091
## `room_type_Shared room` -1.153e+01 6.982e-01 -16.512
## Pr(>|t|)
## (Intercept) < 2e-16 ***
## host_acceptance_rate 3.56e-05 ***
## host_is_superhost < 2e-16 ***
## host_listings_count 6.48e-06 ***
## host_total_listings_count 3.96e-12 ***
## latitude < 2e-16 ***
## longitude < 2e-16 ***
## accommodates < 2e-16 ***
## bathrooms_text < 2e-16 ***
## beds < 2e-16 ***
## amenities < 2e-16 ***
## minimum_nights 0.00220 **
## maximum_maximum_nights 0.00311 **
## minimum_nights_avg_ntm 4.82e-10 ***
## maximum_nights_avg_ntm 0.00311 **
## has_availability 2.89e-15 ***
## availability_30 < 2e-16 ***
## availability_365 < 2e-16 ***
## number_of_reviews_l30d 2.33e-06 ***
## instant_bookable 2.12e-05 ***
## calculated_host_listings_count < 2e-16 ***
## calculated_host_listings_count_entire_homes < 2e-16 ***
## calculated_host_listings_count_private_rooms 2.58e-12 ***
## reviews_per_month < 2e-16 ***
## `room_type_Hotel room` 0.02045 *
## `room_type_Private room` < 2e-16 ***
## `room_type_Shared room` < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.575 on 22643 degrees of freedom
## Multiple R-squared: 0.3411, Adjusted R-squared: 0.3403
## F-statistic: 450.8 on 26 and 22643 DF, p-value: < 2.2e-16
sf_teste(step_modelo_listing_bc$residuals)
##
## Shapiro-Francia normality test
##
## data: step_modelo_listing_bc$residuals
## W = 0.98678, p-value < 2.2e-16
ols_test_breusch_pagan(step_modelo_listing_bc)
##
## Breusch Pagan Test for Heteroskedasticity
## -----------------------------------------
## Ho: the variance is constant
## Ha: the variance is not constant
##
## Data
## -----------------------------------
## Response : bcprice
## Variables: fitted values of bcprice
##
## Test Summary
## -------------------------------
## DF = 1
## Chi2 = 166.4628
## Prob > Chi2 = 4.383087e-38
Além dos resíduos não serem aderentes à normalidade, também observamos que o teste de heterocedasticidade aponta que há variáveis omissas que seriam relevantes para explicar Y.
export_summs(step_modelo_listing, step_modelo_listing_bc,
model.names = c("Modelo Linear","Modelo Box-Cox"),
scale = F, digits = 6)
## Registered S3 methods overwritten by 'broom':
## method from
## tidy.glht jtools
## tidy.summary.glht jtools
| Modelo Linear | Modelo Box-Cox | |
|---|---|---|
| (Intercept) | 2783283.930766 *** | 6891.580161 *** |
| (106736.510416) | (241.009163) | |
| host_acceptance_rate | -7.174236 *** | -0.010967 *** |
| (1.174658) | (0.002652) | |
| host_is_superhost | 473.219863 *** | 1.240267 *** |
| (58.477537) | (0.132041) | |
| host_listings_count | -12.016408 *** | -0.024163 *** |
| (2.372168) | (0.005356) | |
| host_total_listings_count | 9.278918 *** | 0.020599 *** |
| (1.314082) | (0.002967) | |
| latitude | 44635.301751 *** | 111.459061 *** |
| (1731.896440) | (3.910592) | |
| longitude | 21143.923598 *** | 51.128885 *** |
| (1036.981220) | (2.341485) | |
| accommodates | 520.280855 *** | 1.253501 *** |
| (19.768464) | (0.044637) | |
| bathrooms_text | 3190.209610 *** | 6.525012 *** |
| (87.417937) | (0.197388) | |
| beds | 522.931394 *** | 1.165353 *** |
| (40.497830) | (0.091443) | |
| amenities | 31.952626 *** | 0.080181 *** |
| (1.814364) | (0.004097) | |
| minimum_nights | 7.019189 ** | 0.017648 ** |
| (2.551827) | (0.005762) | |
| maximum_maximum_nights | 1.636109 * | 0.004727 ** |
| (0.707957) | (0.001599) | |
| minimum_nights_avg_ntm | -13.883070 *** | -0.038474 *** |
| (2.736124) | (0.006178) | |
| maximum_nights_avg_ntm | -1.636106 * | -0.004727 ** |
| (0.707957) | (0.001599) | |
| has_availability | -1491.016008 *** | -3.543232 *** |
| (198.602532) | (0.448441) | |
| availability_30 | 64.248017 *** | 0.168383 *** |
| (2.583720) | (0.005834) | |
| availability_365 | 1.775529 *** | 0.004846 *** |
| (0.215067) | (0.000486) | |
| number_of_reviews_l30d | -86.579008 *** | -0.242810 *** |
| (22.763789) | (0.051400) | |
| instant_bookable | 255.337518 *** | 0.573175 *** |
| (59.685222) | (0.134768) | |
| calculated_host_listings_count | -146.397867 *** | -0.449140 *** |
| (15.322387) | (0.034598) | |
| calculated_host_listings_count_entire_homes | 160.431316 *** | 0.479931 *** |
| (15.326519) | (0.034607) | |
| calculated_host_listings_count_private_rooms | 88.273160 *** | 0.307351 *** |
| (19.437400) | (0.043889) | |
| reviews_per_month | -372.799997 *** | -0.818256 *** |
| (29.852683) | (0.067407) | |
| `room_type_Hotel room` | 1527.619752 ** | 2.611745 * |
| (498.979792) | (1.126688) | |
| `room_type_Private room` | -3010.019883 *** | -8.979945 *** |
| (101.736309) | (0.229719) | |
| `room_type_Shared room` | -3670.118672 *** | -11.529063 *** |
| (309.219994) | (0.698213) | |
| N | 22670 | 22670 |
| R2 | 0.297627 | 0.341093 |
| *** p < 0.001; ** p < 0.01; * p < 0.05. | ||
Vou adicionar ao dataset valores de Yhat com stepwise e stepwise + Box-Cox para fins de comparação
listing_df$yhat_step_listing <- step_modelo_listing$fitted.values
listing_df$yhat_step_modelo_bc <- (((step_modelo_listing_bc$fitted.values*(lambda_BC$lambda))+
1))^(1/(lambda_BC$lambda))
listing_df %>%
select(price, yhat_step_listing, yhat_step_modelo_bc) %>%
DT::datatable()
## Warning in instance$preRenderHook(instance): It seems your data is too big for
## client-side DataTables. You may consider server-side processing:
## https://rstudio.github.io/DT/server.html
listing_df %>%
ggplot() +
geom_smooth(aes(x = price, y = yhat_step_listing , color = "Stepwise"),
method = "lm", se = F, formula = y ~ splines::bs(x, df = 5), size = 1.5) +
geom_point(aes(x = price, y = yhat_step_listing),
color = "#440154FF", alpha = 0.6, size = 2) +
geom_smooth(aes(x = price, y = yhat_step_modelo_bc, color = "Stepwise Box-Cox"),
method = "lm", se = F, formula = y ~ splines::bs(x, df = 5), size = 1.5) +
geom_point(aes(x = price, y = yhat_step_modelo_bc),
color = "#287D8EFF", alpha = 0.6, size = 2) +
geom_smooth(aes(x = price, y = price), method = "lm", formula = y ~ x,
color = "grey30", size = 1.05,
linetype = "longdash") +
scale_color_manual("Modelos:",
values = c("#287D8EFF", "#440154FF")) +
labs(x = "price", y = "Fitted Values") +
theme(panel.background = element_rect("white"),
panel.grid = element_line("grey95"),
panel.border = element_rect(NA),
legend.position = "bottom")
# Conclusão